Utilizing ConvNext and Transformer Encoders to Improve Results
Chess-Board-with-Boxes
This motivates the search for a unified deep learning model that can learn all components jointly from raw image input.
| Component | # Params | Trainable @ Start? |
|---|---|---|
| ConvNeXt-B backbone | 88 M | ❌ frozen |
| Transformer (4 layers) | 17 M | ❌ frozen |
| Square tokens | 64 × 1024 ≈ 66 k | ✅ |
| Linear head | 13 k | ✅ |
| Total | 105 M | 79 k (0.07 %) |
| Epoch(s) | CNN Blocks Unfrozen | Transformer Layers Unfrozen | What’s Happening |
|---|---|---|---|
| 0 – 1 | 0 / 12 | 0 / 4 | Warm-up: only square tokens + linear head learn |
| 2 | 0 / 12 | last 1 / 4 | Begin adapting highest-level Transformer layer |
| 3 – 14 | last 2 / 12 | last 1 / 4 | Fine-tune high-level ConvNeXt blocks in tandem with Transformer |
| Piece | Instances (train) |
|---|---|
| Pawn | 70 k |
| Queen | 4 k |
| Metric | Baseline ResNeXt (2023) | ConvNeXt-T (+Tx) (Ours) |
|---|---|---|
| Mean incorrect squares / board | 3.40 | 4.33 |
| Boards with no mistakes (%) | 15.26 | 9.12 |
| Boards with ≤ 1 mistake (%) | 25.92 | 19.38 |
| Per-square error rate (%) | 5.31 | 5.94 |
Annotation‑free training. The model learns board geometry from scratch—no corner clicks, homography, or bounding‑box labels are required. Every image only needs its FEN string.
Entire ChessReD leveraged (10 800 photos). By discarding the bounding‑box requirement we expand training data 5× compared with prior work that used only the 2 k labelled subset, capturing far more variation in angle, lighting, and occlusion.
Self‑attention acts as an implicit board detector. Square‑token attention heads automatically focus on rank/file edges and coordinate markings, letting the network infer orientation and resolve piece‑type ambiguities
Gradual unfreezing preserves ImageNet priors. Keeping ConvNeXt frozen for three epochs prevents catastrophic forgetting, then fine‑tunes just the top two blocks alongside the transformer for domain specificity without over‑fitting.
For Streamlit Application, its is very important to load data into prediction scheme in the exact way as training
Real‑world chess recognition is hard, but moving from brittle pipelines to end‑to‑end learning, backed by the new ChessReD dataset, is the way forward.
DSAN 6500 End-to-End Chess Recognition Presentation